Deep Audio-visual Learning: A Survey

نویسندگان

چکیده

Abstract Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully. Researchers tend leverage these two modalities improve performance of previously considered single-modality tasks or address new challenging problems. In this paper, we provide a comprehensive survey recent audio-visual development. We divide current into four different subfields: separation localization, correspondence generation, representation learning. State-of-the-art methods, as well remaining challenges each subfield, are further discussed. Finally, summarize commonly datasets challenges.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach

Automatic speech recognition (ASR) on video data naturally has access to two modalities: audio and video. In previous work, audio-visual ASR, which leverages visual features to help ASR, has been explored on restricted domains of videos. This paper aims to extend this idea to open-domain videos, for example videos uploaded to YouTube. We achieve this by adopting a unified deep learning approach...

متن کامل

Multimodal Transfer Deep Learning with Applications in Audio-Visual Recognition

We propose a transfer deep learning (TDL) framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality. Specifically, we show that we can leverage speech data to fine-tune the network trained for video recognition, given an initial set of audio-video parallel dataset within the same semantics. Our approach first learns the analogyp...

متن کامل

Comparing the Impact of Audio-Visual Input Enhancement on Collocation Learning in Traditional and Mobile Learning Contexts

: This study investigated the impact of audio-visual input enhancement teaching techniques on improving English as Foreign Language (EFL) learnersˈ collocation learning as well as their accuracy concerning collocation use in narrative writing. In addition, it compared the impact and efficiency of audio-visual input enhancement in two learning contexts, namely traditional and mo...

متن کامل

Deep Visual Domain Adaptation: A Survey

Deep domain adaption has emerged as a new learning technique to address the lack of massive amounts of labeled data. Compared to conventional methods, which learn shared feature subspaces or reuse important source instances with shallow representations, deep domain adaption methods leverage deep networks to learn more transferable representations by embedding domain adaptation in the pipeline o...

متن کامل

Perceptual audio loss function for deep learning

PESQ, Perceptual Evaluation of Speech Quality [5], and POLQA, Perceptual Objective Listening Quality Assessment [1] , are standards comprising a test methodology for automated assessment of voice quality of speech as experienced by human beings. The predictions of those objective measures should come as close as possible to subjective quality scores as obtained in subjective listening tests, us...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International Journal of Automation and Computing

سال: 2021

ISSN: ['1751-8520', '1476-8186']

DOI: https://doi.org/10.1007/s11633-021-1293-0